Today we’re going to look at how we can use joins and geospatial data more explicitly to make maps. Maps are among the first data visualizations that ever occured. And some of the most powerful. They’re also one of the places where joins become incredibly important, as to put data on a map we have to join our data with a geospatial description of the map we want.
Today’s data set that we’ll be using is a data set of heart disease mortality from the CDC
#load the data and prep
#for some data manipulation
library(readxl)
library(dplyr)
heart_disease <- read_excel("./data/hd_all.xlsx",
na="Insufficient Data")
head(heart_disease)
## # A tibble: 6 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 Alabama Autauga 463 1001
## 2 Alabama Baldwin 391 1003
## 3 Alabama Barbour 533 1005
## 4 Alabama Bibb 511 1007
## 5 Alabama Blount 426 1009
## 6 Alabama Bullock 483 1011
OK, we see that we have state, county, and information on death. FIPS codes, FYI, are standardized county codes. We’ll be ignoring them.
There are a LOT of ways to get map data into R. We’re going to begin with the simplest - grabbing it from an R package. ggplot2 works in tandem with the maps package to provide a few standardized sets of maps for easy plotting. Let’s take a look at one of counties in the U.S. lower 48.
#install the mapdata library if you
#don't have it
library(ggplot2)
#map_data gets us one of the select maps
map_df <- map_data("county")
head(map_df)
## long lat group order region subregion
## 1 -86.50517 32.34920 1 1 alabama autauga
## 2 -86.53382 32.35493 1 2 alabama autauga
## 3 -86.54527 32.36639 1 3 alabama autauga
## 4 -86.55673 32.37785 1 4 alabama autauga
## 5 -86.57966 32.38357 1 5 alabama autauga
## 6 -86.59111 32.37785 1 6 alabama autauga
OK, we have latitude and longitude of county borders, a group (each county has one group), and both a region and subregion. Note we don’t have states and counties - this map is a bit broader than that. It includes cities and US Territories. Also note that capitalization is wonky - it’s all lower case.
To show you how we would use this data
ggplot(data=map_df, mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()
geom_polygon is our newest friend, and it creates polygons from the lat/long paths above. There are other ways to use this map, but it’s the clearest.
OK, so, we want to see how death from heart disease varies by county! To do this, we’ll need to join the two data frames. Note, the number of rows in map_df is YUGE compared to those in our data set, as they define county borders. So we’re going to be merging one data point to many rows in the map_df data frame. But, we need to do some prep work first.
First, we need to fix up the columns in our heart_disease data set to not be capitalized, and we need to rename those columns in our map data frame.
map_df <- map_df %>%
rename(State = region,
County = subregion)
heart_disease <- heart_disease %>%
mutate(State = tolower(State),
County = tolower(County))
Now let’s test out the join! Before we do the big join, let’s look at mismatch. This is the perfect job for anti_join
bad_match <- anti_join(heart_disease, map_df)
## Joining, by = c("State", "County")
nrow(bad_match)
## [1] 112
head(bad_match)
## # A tibble: 6 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 alabama dekalb 524 1049
## 2 alabama st. clair 422 1115
## 3 alaska aleutians east 106 2013
## 4 alaska aleutians west 217 2016
## 5 alaska anchorage 253 2020
## 6 alaska bethel 359 2050
Uh oh! 112 rows of bad matches! Why? Well, we can see one of the reasons quickly - the US Virgin Islands. They are not in the map data set. Second, though, we can see st. croix - indeed, the map data frame has no . or ’ characters in it, so, we should filter those out of heart_disease and then re-check.
heart_disease <- heart_disease %>%
mutate(County = gsub("\\.", "", County)) %>%
mutate(County = gsub("\\'", "", County))
bad_match2 <- anti_join(heart_disease, map_df)
## Joining, by = c("State", "County")
nrow(bad_match2)
## [1] 82
bad_match2
## # A tibble: 82 x 4
## State County Death_Rate FIPS_Code
## <chr> <chr> <dbl> <dbl>
## 1 alabama dekalb 524 1049
## 2 alaska aleutians east 106 2013
## 3 alaska aleutians west 217 2016
## 4 alaska anchorage 253 2020
## 5 alaska bethel 359 2050
## 6 alaska bristol bay NA 2060
## 7 alaska denali 272 2068
## 8 alaska dillingham 382 2070
## 9 alaska fairbanks north star 269 2090
## 10 alaska haines 265 2100
## # ... with 72 more rows
OK, much better - down to 82. And looking at what’s left, those are cities, not counties, so, should not affect our map making.
Are we good the other way?
bad_match3 <- anti_join(map_df, heart_disease)
## Joining, by = c("State", "County")
#because we get a data frame return
length(unique(bad_match3$County))
## [1] 8
unique(bad_match3$County)
## [1] "de kalb" "washington" "de soto"
## [4] "du page" "la porte" "yellowstone national"
## [7] "la moure" "de witt"
Only 8. Were we trying to make this perfect, we’d try and figure out why, but, for now, let’s soldier on. 8 is acceptable.
So we know that there are some territories and cities left in the heart disease data, and we don’t want to worry about them for the moment. Given that we’ve got some mismatch on both sides, we want to use an inner_join - although we could also outer_join so that the entire map is retained, and thus see what counties we have missing data for. Try it both ways.
heart_disease_map_data <- inner_join(heart_disease, map_df)
## Joining, by = c("State", "County")
And now - let’s plot it!
heart_map <- ggplot(data=heart_disease_map_data,
mapping = aes(x = long, y = lat, group = group,
fill = Death_Rate)) +
geom_polygon()
heart_map
So, this is a choropleth map. There are some issues - the scale is hard to resolve, and everything is hard to see. There are a lot of ways we could rectify it. Here are a few tricks.
First, a better scale. Maybe a gradient?
heart_map +
scale_fill_gradient(low = "white", high = "red")
OK, not bad! Not great, but much better. Are there gradients you’d prefer?
However, the bigger problem is that we have a huge range to cover. One common way to make choropleths is to bin data into categories and go from there. Binning in combination with a nice color scale - say via Color Brewer - cab be great. Remember Color Brewer? http://colorbrewer2.org/ actually uses maps as it’s demo! And there’s a scale_fill_brewer function in ggplot2.
So let’s make bins and plot that instead!
heart_map_binned <- ggplot(data=heart_disease_map_data,
mapping = aes(x = long, y = lat, group = group,
fill = cut_interval(Death_Rate, 5))) +
geom_polygon()
heart_map_binned +
scale_fill_brewer(palette="Reds")
MUCH nicer. Now we can really start to see some hot spots.
For our faded examples, today we’re going to look at data on TB incidence from the World Health organization. This look at TB for the entire planet at the country level. Let’s take a look
library(readr)
tb_world <- read_csv("./data/who_tb_data.csv")
tb_world
## # A tibble: 3,651 x 18
## country iso2 iso3 g_whoregion year estimated_popul… est_incidences_…
## <chr> <chr> <chr> <chr> <int> <int> <dbl>
## 1 Afghan… AF AFG EMR 2000 20093756 190
## 2 Afghan… AF AFG EMR 2001 20966463 189
## 3 Afghan… AF AFG EMR 2002 21979923 189
## 4 Afghan… AF AFG EMR 2003 23064851 189
## 5 Afghan… AF AFG EMR 2004 24118979 189
## 6 Afghan… AF AFG EMR 2005 25070798 189
## 7 Afghan… AF AFG EMR 2006 25893450 189
## 8 Afghan… AF AFG EMR 2007 26616792 189
## 9 Afghan… AF AFG EMR 2008 27294031 189
## 10 Afghan… AF AFG EMR 2009 28004331 189
## # ... with 3,641 more rows, and 11 more variables:
## # est_incidences_per_100K_people_lwr <dbl>,
## # est_incidences_per_100K_people_upr <dbl>, est_incidences <int>,
## # est_incidences_lwr <int>, est_incidences_upr <int>,
## # est_mortality_per_100K_people <dbl>,
## # est_mortality_per_100K_people_lwr <dbl>,
## # est_mortality_per_100K_people_upr <dbl>, est_mortality <int>,
## # est_mortality_lwr <int>, est_mortality_upr <int>
There’s a lot going on here - in particular let’s focus on estimated insidences, incidences per 100K people, and mortality. upr and lwr denote upper and lower bounds to estimates.
So, a simple map of 2015.
#get a map
worldmap <- map_data("world")
#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)
#join
incidence_df <- left_join(worldmap, tb_2015, by = c("region" = "country"))
#plot
ggplot(incidence_df, aes(x = long, y = lat,
group = group, fill=est_incidences)) +
geom_polygon() +
scale_fill_gradient(low = "blue", high = "red")
Huh - not everything joins well, but we’ll ignore that for now.
Let’s look at mortality in 2000
#get a map
worldmap <- map_data("____")
#filter to 2015
tb_2000 <- tb_world %>% filter(year == ____)
#join
incidence_df_2000 <- left_join(worldmap, ____, by = c("region" = "country"))
#plot
ggplot(incidence_df_2000, aes(x = ____, y = ____,
group = group, fill=est_mortality)) +
geom_polygon() +
scale_fill_gradient(____)
Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill
#get a map
worldmap <- ____("____")
#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)
#join
incidence_df_2016 <- left_join(worldmap, ____, by = c("region" = "country"))
#plot
ggplot(incidence_df_2016, aes(x = ____, y = ____,
group = ____,
fill=____(est_incidences_per_100K_people,5))) +
____() +
scale_fill_brewer(palette = "YlOrBr",
guide = guide_legend("Incidences per 100K"))
What does not join well? Can you fix any of them via changes in the tb or worldmap dataset?
Can you make a map showing change in mortality over four years (filter to four years, and then use facet_wrap)?
Most map data comes from separate files from R. You’ll see shapefiles most commonly as files that define borders. Shapefiles are funky things. Let’s take a look at the shapefile for a common classification of marine ecoregions of the world. These are coastal regions binned by biogeographic boundaries. To load them, we’ll need the sp and rgdal libraries. These might take some doing to install properly, but, it’s worth it.
library(sp)
library(rgdal)
ecoregions <- readOGR("./data/MEOW-TNC/meow_ecos.shp", layer="meow_ecos")
## OGR data source with driver: ESRI Shapefile
## Source: "./data/MEOW-TNC/meow_ecos.shp", layer: "meow_ecos"
## with 232 features
## It has 9 fields
plot(ecoregions)
Neat - these kinds of files can be plotted on their own! But, we’re going to take this a bit further.
So, if we dig into this object, what do we see.
class(ecoregions)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
#It's an S4 object
slotNames(ecoregions)
## [1] "data" "polygons" "plotOrder" "bbox" "proj4string"
So, it’s a SpatialPolygonsDataFrame. Weird. And it has all of these pieces to it - called slots - as it is what we call an S4 object. More on that another time, but, suffice to say, instead of the $ notation we use an @ notation to access pieces of it. Such as
head(ecoregions@data)
## ECO_CODE ECOREGION PROV_CODE
## 0 20192 Agulhas Bank 51
## 1 20053 Aleutian Islands 10
## 2 20072 Amazonia 13
## 3 20194 Amsterdam-St Paul 52
## 4 20228 Amundsen/Bellingshausen Sea 61
## 5 20109 Andaman and Nicobar Islands 24
## PROVINCE RLM_CODE REALM
## 0 Agulhas 10 Temperate Southern Africa
## 1 Cold Temperate Northeast Pacific 3 Temperate Northern Pacific
## 2 North Brazil Shelf 4 Tropical Atlantic
## 3 Amsterdam-St Paul 10 Temperate Southern Africa
## 4 Continental High Antarctic 12 Southern Ocean
## 5 Andaman 5 Western Indo-Pacific
## ALT_CODE ECO_CODE_X Lat_Zone
## 0 189 192 Temperate
## 1 50 53 Temperate
## 2 64 72 Tropical
## 3 191 194 Temperate
## 4 209 228 Polar
## 5 105 109 Tropical
Here we see the ecoregion names and some other properties of them. Other slots are filled with information that defines polygons, etc.
So, to make this into a data frame for mapping, we have to work a little bit of magic.
First, we need to turn those polygons into a data frame of points. There’s a function for this in broom called tidy. We need to tell it what column we are using to separate regions
library(broom)
ecoregions_points = tidy(ecoregions, region="ECOREGION")
head(ecoregions_points)
## long lat order hole piece group id
## 1 20.46950 43.05329 1 FALSE 1 Adriatic Sea.1 Adriatic Sea
## 2 20.96008 40.52461 2 FALSE 1 Adriatic Sea.1 Adriatic Sea
## 3 20.90520 40.49428 3 FALSE 1 Adriatic Sea.1 Adriatic Sea
## 4 19.68136 40.52051 4 FALSE 1 Adriatic Sea.1 Adriatic Sea
## 5 19.55918 40.53546 5 FALSE 1 Adriatic Sea.1 Adriatic Sea
## 6 19.38904 40.57329 6 FALSE 1 Adriatic Sea.1 Adriatic Sea
And now the last step - a join! We need to join the original data defining names of ecoregions with the data defining their borders.
ecoregions_df = inner_join(ecoregions_points, ecoregions@data,
by=c("id" = "ECOREGION")) %>%
rename(ECOREGION = id)
## Warning: Column `id`/`ECOREGION` joining character vector and factor,
## coercing into character vector
Voila! We have ecoregions in a data frame format! Now let’s plot it. Note, I’m using group here, as group fixes problems with the map wrapping round both sides of the screen. Try ECOREGION if you don’t believe me.
ggplot(data=ecoregions_df,
mapping = aes(x = long, y = lat, group = group)) +
geom_polygon()
OK, we have some polygons - now what are we going to do with them? One use is to visualize climate change over time. I’ve been working on a project to look at climate change in Ecoregions that contain kelp. So, for each Ecoregion, I’ve gotten all of the temperature data from 1950 - the present from http://www.metoffice.gov.uk/hadobs/hadisst/data/download.html - a great source for SST data.
sst_data <- read.csv("./data/hadsst_regions.csv",
stringsAsFactors=FALSE)
head(sst_data)
## ECOREGION DateName tempC Year
## 1 Agulhas Bank 1950.01.16 21.137959 1950
## 2 Aleutian Islands 1950.01.16 3.909404 1950
## 3 Bassian 1950.01.16 14.309411 1950
## 4 Beaufort Sea - continental coast and shelf 1950.01.16 -1.800000 1950
## 5 Cape Howe 1950.01.16 18.455363 1950
## 6 Celtic Seas 1950.01.16 10.399248 1950
How can we use this data to see climate change on a map? There are a lot of methods. One simple one is to visualize change by decade. First, we have to use a quick trick to calculate decades - divide by ten, round to a whole number, and then multiple by ten.
sst_decadal <- sst_data %>%
mutate(Decade = round(Year/10)*10) %>%
group_by(ECOREGION, Decade) %>%
summarise(tempC = mean(tempC, na.rm=T)) %>%
ungroup()
Now, we can look at temperature by decade, but, we can’t just look at raw temperature. The color palatte would be too spread out, and we couldn’t compare, say, change in the tropics to change in the Arctic. One way that scientists have gotten around this is to look at temperature anomoly - that is, deviation from a long-term mean. So let’s calculate the decadal deviation from the long-term average for each ecoregion.
sst_decadal <- sst_decadal %>%
group_by(ECOREGION) %>%
mutate(tempC_anomoly = tempC - mean(tempC)) %>%
ungroup()
OK, now we’ve got something worth plotting!
There are a lot of ways we can visualize this. We can look at long-term trends, of course, and even look at each region as a single data point to get average long-term trends
ggplot(data=sst_decadal, mapping=aes(x=Decade, y=tempC_anomoly)) +
geom_line(mapping=aes(group=ECOREGION), colour="grey") +
theme_bw() +
stat_smooth(method="lm", color="red", fill=NA) +
ylab("Decadal Temperature Anomoly")
But how do we put this information on a map so that we can truly see it?
As we did with states, we can join the temperature data to the marine ecoregion data.
sst_map_data <- inner_join(sst_decadal, ecoregions_df)
## Joining, by = "ECOREGION"
Once that is done, we can use this decadal anomoly as a fill, with facets to show different decades.
ggplot(data=sst_map_data,
mapping = aes(x = long, y = lat,
group = group, fill = tempC_anomoly)) +
borders("world", lwd=0.1, colour="black") +
geom_polygon(alpha=0.9) +
facet_wrap(~Decade) +
scale_fill_gradient(low = "blue", high = "red") +
theme_bw(base_size=14) +
coord_equal()
It’s a different way of looking at the same data. But what do you see? What are things we can do to make it clearer?
Let’s go back to our Tb data. Here, we’re going to use a shapefile of international borders.
world_shapefile <- readOGR("./data/TM_WORLD_BORDERS_SIMPL-0.3")
## OGR data source with driver: ESRI Shapefile
## Source: "./data/TM_WORLD_BORDERS_SIMPL-0.3", layer: "TM_WORLD_BORDERS_SIMPL-0.3"
## with 246 features
## It has 11 fields
## Integer64 fields read as strings: POP2005
What’s inside?
plot(world_shapefile)
head(world_shapefile@data)
## FIPS ISO2 ISO3 UN NAME AREA POP2005 REGION SUBREGION
## 0 AC AG ATG 28 Antigua and Barbuda 44 83039 19 29
## 1 AG DZ DZA 12 Algeria 238174 32854159 2 15
## 2 AJ AZ AZE 31 Azerbaijan 8260 8352021 142 145
## 3 AL AL ALB 8 Albania 2740 3153731 150 39
## 4 AM AM ARM 51 Armenia 2820 3017661 142 145
## 5 AO AO AGO 24 Angola 124670 16095214 2 17
## LON LAT
## 0 -61.783 17.078
## 1 2.632 28.163
## 2 47.395 40.430
## 3 20.068 41.143
## 4 44.563 40.534
## 5 17.544 -12.296
Oh neat! Notice the ISO2 and ISO3 columns? Those are designed to match the standardized columns from the tb_world data! Once we turn this into a data frame, should be a snap to merge! Let’s make the same plot as we did previously, but now with shapefile.
So, TB in 2015!
#get a map
worldmap_shp <- tidy(world_shapefile, region="NAME")
worldmap_shp_df <- left_join(worldmap_shp, world_shapefile@data,
by = c("id" = "NAME")) %>%
rename(iso2 = ISO2, iso3 = ISO3)
## Warning: Column `id`/`NAME` joining character vector and factor, coercing
## into character vector
#filter to 2015
tb_2015 <- tb_world %>% filter(year == 2015)
#join
incidence_df <- left_join(worldmap_shp_df, tb_2015)
## Joining, by = c("iso2", "iso3")
## Warning: Column `iso2` joining factor and character vector, coercing into
## character vector
## Warning: Column `iso3` joining factor and character vector, coercing into
## character vector
#plot
ggplot(incidence_df, aes(x = long, y = lat,
group = group, fill=est_incidences)) +
geom_polygon() +
scale_fill_gradient(low = "blue", high = "red")
Let’s look at mortality in 2000
#get a map
worldmap_shp <- tidy(world_shapefile, region="NAME")
worldmap_shp_df <- ____(worldmap_shp, world_shapefile@data,
by = c("id" = "NAME")) %>%
rename(iso2 = ISO2, iso3 = ISO3)
#filter to 2000
tb_2000 <- tb_world %>% filter(year == ____)
#join
incidence_df_2000 <- left_join(____, ____)
#plot
ggplot(incidence_df_2000, aes(x = ____, y = ____,
group = group, fill=est_mortality)) +
geom_polygon() +
scale_fill_gradient(____)
Now let’s look at incidence per 100K, but, only by interval, in 2016 - feel free to do something difference with the fill
#get a map
worldmap_shp <- ____(____, ____="NAME")
worldmap_shp_df <- ____(____, world_shapefile@____,
by = c("id" = "NAME")) %>%
rename(iso2 = ISO2, iso3 = ISO3)
#filter to 2016
tb_2016 <- tb_world %>% ____(year == ____)
#join
incidence_df_2016 <- left_join(____, ____, by = c("region" = "country"))
#plot
ggplot(incidence_df_2016, aes(x = ____, y = ____,
group = ____,
fill=____(est_incidences_per_100K_people,5))) +
____() +
scale_fill_brewer(____ = "YlOrBr",
____ = guide_legend("Incidences per 100K"))
Use the shapefile maps for this.
Make maps that compare total incidence to incidence per 100K averaged across all years.
Do you see different trends in invidence over time?